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1.  INTRODUCTION 


The  great  technological  progress  embodied  in  very  large  scale 
integration  (VLSI)  of  electronic  circuits  has  made  it  possible  to 
conceive  large  systems  of  processing  elements  cooperating  in  the  execu¬ 
tion  of  parallel  algorithms.  This  has  motivated  considerable  research 
interest  in  parallel  computation.  Unfortunately,  here  the  situation  is 
very  different  from  that  of  serial  computation,  where  the  RAM  machine 
[1]  represents  a  universally  accepted  model.  The  difficulty  of  choosing 
a  specific  interconnection  is  frequently  bypassed  by  assuming  a  model 
(shared-memory-machine)  where  each  pair  of  processors  is  connected  (or 
an  equivalent  system)  [2-5].  Although  not  without  merit,  because  it  aims 
at  uncovering  the  inherent  data-dependence  of  given  problems,  this 
approach  ignores  the  technological  constraints  of  VLSI,  particularly  as 
regards  the  communication  among  the  processing  elements  [6].  At  the 
opposite  end,  other  workers  [7-11]  suggest  that  processor  interconnection 
should  be  limited  to  planar  links  between  topologically  neighboring  cells 
(arrays  or  meshes).  Such  designs  are  certainly  well  suited  for  current 
VLSI  technology,  and  they  have  cleverly  been  used  in  Implementing  algorithms  for 
matrices  or  graph  problems  [9-12],  for  example.  This  type  of  connection, 
however,  is  not  suited  for  efficiently  implementing  algorithms  for 
various  fundamental  problems,  such  as  sorting  and  convolution.  Indeed, 
good  algorithms  for  solving  these  problems  intrinsically  require  data 
movement  between  processors  which  are  topologically  far  apart;  for 
example,  sorting  on  an  n  processor  array  such  as  ILLIAC  IV  requires 
time  (<*/n)  [8]. 
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The  purpose  of  the  paper  is  to  propose  and  analyze  a  new  interconnec¬ 
tion  of  processors,  called  the  cube-connected-cyc les .  which  is  remarkably 
suited  for  implementing  efficient  algorithms  such  as  Fast-Fourier-Trans- 
form  (FFT) ,  sorting,  etc...  .  The  geometric  structure  underlying  the 
interconnections  is  that  the  k-dimensional  cube.  This  structure  which 

has  already  been  studied  in  relation  to  parallel  computation  [13],  is 

Ic 

not  readily  usable  for  VLSI  design,  since  each  of  the  2  processors  is 
connected  to  k  other  processors. 

By  combining  parallelism  and  pipelining  we  are  able  to  achieve  the 
following  results: 

(1)  The  number  of  connections  per  processor  is  reduced  to  3. 

(2)  Processing  time  is  not  significantly  increased  with  respect 
to  that  achievable  on  the  k-cube  structure. 

(3)  Programs  for  the  individual  modules  are  obtained  in  a  systematic 
way  from  a  standard  description  of  the  global  algorithms . 

(4)  The  overall  structure  complies  with  the  basic  requirements  of 
VLSI  technology:  modularity,  ease  of  layout,  simplicity  of  communication 
among  the  processing  elements,  simplicity  in  timing  and  control  of  the 
entire  system  [14].  We  also  propose  a  wire  layout  of  the  CCC,  which  can 
be  physically  realized  with  two  orthogonal  layers  of  wires v  This  layout 
is  optimal  for  several  problems,  according  to  a  recently  proposed  VSLI 
model  [18]. 

(5)  Finally  we  are  able,  without  resorting  to  any  drastic  departure 
from  classical  algol-like  languages,  to  provide  fully  accurate  and 
hopefully  easily  understandable  descriptions  of  our  parallel  programs. 

This  is  a  favorable  sign  that  parallel  processing  may  possibly  be  endowed 
with  suitable  high  level  programming  languages. 
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This  paper  is  organized  as  follows.  Section  2  introduces  a  class 
of  algorithms  comprising  many  important  applications,  such  as  merging, 
sorting,  Fourier  Transform,  data  rearrangement,  ...  .  Section  3 

presents  models  of  module  connections,  including  the  CCC,  allowing  for 
efficient  parallel  execution  of  the  algorithms  in  Section  2.  Section  4 
describes  the  implementation  of  such  algorithms  on  the  CCC,  and  Section  5 
is  devoted  to  optimality  considerations  regarding  a  layout  of  the  machine 
for  VIi>I  realizations. 


if 
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2.  A  CLASS  C  HIGHLY  PARALLEL  ALGORITHMS 

To  describe  our  algorithms,  assume  that  input  data  tQ, t^, . . . , t  ^ 
are  stored  respectively  in  storage  locations  T[0] ,T[1] , . . . ,T[n-l] ,  and 
that  n  =  2  ,  i.e.,  the  number  of  inputs  is  a  power  of  2.  We  say  that  an 
algorithm  is  in  the  DESCEND  class  if  it  performs  a  sequence  of  basic 
operations  on  data  which  are  successively  2^  \  . . . ,2^ , . . . ,  2^  =  1  loca¬ 
tions  apart-  Each  basic  operation  OPER(m,j ;U,V)  modifies  the  two  data 
items  present  in  storage  locations  U  and  V;  the  computation  performed 
affects  only  the  contents  of  U,V  and  it  may  depend  upon  parameters  m  and 
j,  which  are  integers  0  <  m  <  n,  0  SI  j  <  k. 

Algorithms  in  the  DESCEND  class  are  then  specified  as: 
proc  DESCEND 

for  j  *-  k-1  step  -1  until  j  =  0 
do  foreach  m:  0  <  m  <  n 

pardo  if  bit^(m)  =  0  then  OPER(m, j ;T[m] ,T[nri-2Jl) 
fi 

odpar 

od 

corp  DESCEND  . 

Here,  bitj(m)  is  the  coefficient  of  2^  in  the  binary  representation  of 


m  *  Z  bit  (m)2^.  The  language  construct  foreach  m:  <cond(m)>  pardo 
jseO  J 

<action>  odpar  obviously  indicates  that  all  instructions  <action>  cor¬ 
responding  to  values  of  m  satisfying  <cond(m)>  can  be  performed  simultaneously. 
On  machines  where  such  parallelism  can  be  realized,  DESCEND  algorithms  run 
in  k  *  log2(n)  elementary  steps. 

We  also  introduce  the  dual  class  ASCEND,  where  the  control  of  the 
algorithm  is  changed  to 

for  j  -  0  step  1  until  j  ■  k-1, 
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i.e.,  OPER  is  performed  on  data  which  are  successively 

1  =  2^,2^, . . . ,2^ , . . . ,2^  ^  locations  apart.  To  clarify  the  duality  between 

ASCEND  and  DESCEND  consider  the  binary  representation  of  m  =  £  bit. (m) »2L 

a£i<k  1 

a  l 

and  define  m  =  £  bit.(m)*2  ,  the  integer  whose  binary  represen- 

C£i<k  L 

tation  is  the  reversal  of  that  of  m.  Once  k  is  fixed,  the  function: 

1c 

n-*B  is  an  involutory  permutation  of  0,1 . 2  -1  known  as  the  bit 

reversal  permutation  (BRP) .  For  example,  for  k  =  3,  the  BRP  of 
(0  1  2  3  4  5  6  7)  is  (04261537). 


By  first  applying  the  BRP  to  its  inputs,  an  ASCEND  algorithm  can  be 
transformed  into  a  dual  DESCEND  algorithm  (figure  1)  whose  basic  operation 
OPER  is  related  to  the  original  OPER  by: 

0PER(m, j;U,V)  =  0PER(m,k-l- j;U,V) 

01234567  input  04261537 

j  =  2  j  -  0 


OPER 

0*  1*  2*  3*  4*  5*  6 1  7*  of  41  2*  6*  1*  5f  3*  7* 


j  =  i 


j  =  0 


0"  1"  2"  3"  4"  5"  6"  7" 

qii  i  1 2"  1 311 1411 » 511 1  gll  1 711 1 

DESCEND 


OPER 


OPER 


j  =  1 


0"  4"  2"  6"  1"  5"  3"  7" 

qm ' 4" ' 2" ' 6" ' 1" 1 5" ' 3” 1 7" ' 
ASCEND 


j  =  2 


Figure  1.  Dual  algorithms;  operands  are  denoted  by  their  original 
addresses,  connecting  lines  show  interacting  operands, 
and  priming  indicates  the  number  of  operations  through 
which  an  operand  has  been  processed. 


I 

l 
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It  is  now  time  to  exhibit  algorithms  for  solving  specific  interesting 
problems.  Some  applications  -  such  as  bitonic  merge  and  cyclic  shift  - 
are  directly  within  the  ASCEND  or  DESCEND  classes  (simple  algorithms); 
for  these  applications,  all  we  have  to  do  is  specify  0PER(m, j ;U,V) . 


i 


m:  j 
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Ocher  applications  (such  as  permutation,  shuffle ,  unshuffle,  bit- 
reversal  (BRP) ,  odd-even-merge.  Fas t-Fourier-Transform,  convolution, 
matrix  transposition)  have  programs  consisting  of  a  short  sequence 
of  algorithms  (cascaded  algorithms)  in  the  preceding  class,  and  thus 
run  in  O(logn)  parallel  steps. 

We  also  have  applications  -  such  as  bitonic  sort,  odd-even-sort, 

and  calculations  of  symmetric  functions  -  for  which  the  combining  step 

of  the  two  results  of  a  recursive  call  is  itself  an  algorithm  in  one 

of  the  two  preceding  categories.  These  algorithms,  which  we  call 

2 

composite ,  run  in  O((logn)  )  parallel  steps. 

2 . 1  Bi tonic  Merge 

The  elegant  algorithm  for  bi tonic  merge,  due  to  K.  E.  Batcher 
[15],  is  ideally  suited  for  implementation  within  the  DESCEND  class. 

All  we  need  is  to  specify  OPER(m, j ;U,V)  as  a  comparison-exchange. 
Precisely,  in  order  to  handle  sequences  which  are  sorted  either  in 
increasing  or  in  decreasing  order,  we  define  ORIENTCOMPEXCHANGE (m, j ;U,V) 
as 

if  bitj(m)  =  0  then  (U ,V)'~(min  (U,V),  max  (U,V)) 
else  (U^^max  (U,V),  min  (U,V)) 

fi  . 

Batcher's  odd-even  merge  [15,16]  can  also  be  programmed  as  a  cascaded 
algorithm,  running  in  O(logn)  parallel  steps. 

2.2  Radix-2  Fast-Fourier-Transforms  and  Convolution 

The  important  FFT  algorithm  can  be  set  in  the  ASCEND  class.  Let 

u)  be  a  primitive  root  of  unity  of  order  n  =  2  .  If  <Aq, . . . ,An_j> 

is  the  Fourier  Transform  of  vector  <aQ,...,a  ^>,  it  is  well-known 

that  A.  =  U.  +  u)^V .  and  A  v  ,  =  U  -  wJV  where  the  U's  and  V's 
J  J  J  1+2K-i  ^  J 
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are  respectively  the  Fourier  Transforms,  with  primitive  root  x  ,  of 

the  "even"  subsequence  <a  , a  ,  ...,a  >  and  the  "odd"  subsequence 

U  Z  2  -2 

<a,,a  , a  ,  >;  we  call  the  tip's  the  combining  root  powers. 

1  3  2-1 

The  above  relationships  indicate  that  the  sequence  <a^ , . . . , a^_ 
must  be  initially  rearranged  by  means  of  the  bit-reversal  permutation. 

Once  the  desired  reconfiguration  has  been  achieved,  we  may  proceed  with 
the  actual  FFT  computation,  which  is  in  the  ASCEND  class. 

Its  basic  operation  OPER(m, j ;U ,V)  is  specified  by 
(U , V)  ~  (U+Qfl^U-aV)  where  a  =  u> 

It  is  not  hard  to  show  that  or  can  be  computed  efficiently  at  each  step; 

precisely,  the  time  used  by  each  module  to  compute,  by  successive  squaring, 

the  required  combining  root  powers  for  the  entire  algorithm  is 
2 

O((loglogn)  )  =  o(logn).  Using  a  sequence  of  two  inverse  Fourier  transforms 
in  the  classical  manner  [1]  allows  one  to  compute  the  convolution  of  two 
sequences,  from  which  a  wealth  of  applications  can  be  derived  (see  [1]). 

2 .3  Data  Rearrangements 

Being  able  to  efficiently  permute  the  data  is  obviously  important 

for  may  applications.  For  example,  the  BRP  rearrangement  is  a  necessary 

preliminary  step  to  the  FFT  algorithm  of  the  preceding  section.  Some 

permutations,  such  as  cyclic  shifts,  shuffle,  and  unshuffle  can  be 

computed  by  algorithms  in  ASCEND  or  DESCEND,  as  the  reader  will 

k 

enjoy  discovering  for  himself  (here  "shuffle"  of  (0,1,2,.. .,2  -1)  is 
(0,2^  ^,1,2^  ^+1,...,2^  ^-1,2^-1)  and  "unshuffle"  is  the  inverse 
permutation).  Other  permutations,  such  as  BRP  or  matrix  transpose,  are 
computed  by  cascaded  algorithms.  In  general,  we  can  emulate  a  Benes 
permutation  network  [21]  by  a  sequence  ASCEND ;DESCEND ,  thus  in  time 
0(logn);  it  must  be  pointed  out,  however,  that  to  realize  an  arbitrary 
permutation,  the  exchange  information  must  be  precomputed. 


/  F 
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2 .4  Sorting  and  Calculation  of  Symmetric  Functions 

The  previously  described  merge  routines  can  be  used  as  the  basis  of 

efficient  sorting  algorithms.  A  sequence  of  input  keys  is  divided  into 

two  halves,  each  of  which  is  recursively  sorted  (in  opposite  order  in 

the  case  of  bitonic  sort),  and  then  merged  using  either  of  the  above 

2 

merge  routines.  Both  algorithms  run  in  time  O((logn)  ). 

One  can  compute  symmetric  functions  in  a  completely  analogous 
fashion:  apply  recursive  calls  to  each  half  of  the  data,  and  compute  the 

9 

convolution  of  the  two  resulting  sequences,  again  in  time  O((logn)“). 

2.5  Matrix  Multiplications  and  Other  Algorithms 

To  compute  the  matrix  product  C  =  A  X  B  of  two  n  x  n  matrices,  we 

T  T  T 

must  obviously  first  store  A  =  (A^.-.A^  j)  in  row  major  order,  and 

B  =  (Bq-.-B^P  in  column  major  order.  Assuming  we  have  enough  space 

k  3 

and  processors,  i.e.,  2  a  n  ,  we  copy  A  and  B  into  the  pattern: 

A^B^A^B. . . .A_B  .A.B-. . .A.B . . . .A  ,B  ..  All  this  can  be  achieved 
0001  On- 110  ij  n-ln-1 

with  simple-minded  cascaded  algorithms,  in  time  0(logn). 

Each  of  the  scalar  products  c.  .  =  A.  *B .  =  £  a.  *b,  .  is  computed  in 

i,j  i  j  k,j 

parallel,  within  0(logn)  additional  time  units.  The  results  c.  .  are 

*•  >  J 

then  regrouped,  according  to  the  output  format  (say,  row  major) . 

Although  the  details  of  this  algorithm  are  a  bit  tedious  to  describe, 
it  should  be  clear  that  matrix  multiplication  can  be  computed  in  time 
0 ( logn) ,  within  our  class  of  algorithms.  In  fact,  a  surprising  number 
of  other  algorithms  can  be  efficiently  implemented  within  this  framework, 
including  all  of  the  interesting  algorithms  for  parallel  processing  known 


to  the  authors. 


~r 


9 


3.  DESCRIPTION  OF  THE  SCHEME 

In  order  to  efficiently  implement  algorithms  in  the  DESCEND  class, 
the  most  natural  interconnection  of  modules  is  that  of  the  k-dimensional 
binary  cube  (k-cube)  where  each  of  the  2  processors  is  numbered  from 
0  to  2  -1  and  is  connected  to  each  of  the  k  processors  whose  binary 
numbering  differs  in  exactly  one  binary  position  (figure  2).  Although 
an  ASCEND  or  DESCEND  algorithm  can  be  implemented  on  such  a  machine  in 
log^n  parallel  steps,  this  proposal  is  not  feasible  mainly  because  the 
number  k  =  log2n  of  connections  for  each  processor  is  too  large.  The 
unfolded  k-cube  and  the  perfect  shuffle  interconnections  have  been  proposed 
[17]  (figure  3),  as  attempts  to  remedy  this  difficulty. 


0 


Figure  2.  The  3-cube. 


Figure  3.  Unfolded  3-cube  (left)  and  perfect  shuffle  (right)  interconnections 

Although  both  structures  have  a  fixed  number  (4)  of  connections  per 
processor,  their  intrinsic  topology  make  them  inferior,  as  regards  physical 
layout  (see  section  5),  to  the  scheme  we  now  describe. 

Out  parallel  computing  system,  the  cube-connected-cyc les  (CCC),  is 
a  network  of  identical  processors,  called  modules .  A  module  has  3  inter¬ 
connection  ports.  Each  interconnection  line  linking  two  modules  can  be 
used  for  the  bidirectional  transmission  of  one  operand,  and  it  is 
irrelevant  here  whether  operand  transmission  is  serial  or  parallel.  For 
correctly  executing  the  algorithms  described  in  the  preceding  sections, 
it  is  indifferent  to  synchronize  the  entire  system  through  a  central  clock, 
which  defines  time  units  for  all  modules,  or  to  let  synchronization 
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problems  be  settled  at  the  level  of  each  communication  line,  thus  achieving 
a  globally  asynchronous  system.  In  order  to  describe  the  inter¬ 
connections,  we  assume  for  simplicity  that  n,  the  number  of  modules, 
is  a  power  of  two,  i.e.,  n  =  2  ,  and,  moreover,  assume  that  k  is  of  the 
form  k  =  r  +  2r;  the  modifications  resulting  when  k  is  arbitrary  are 
straightforward  (in  the  latter  case,  r  is  the  smallest  integer  for  which 
r  +  2r  ^  k) .  Each  module  has  a  k-bit  address  m  which  in  turn  is 
expressed  as  a  pair  (2, p)  of  integers  represented  with  (k-r)  and  r  bits 
respectively,  such  that  2* 2r  +  p  =  m. 

As  mentioned  earlier,  each  module  has  three  ports:  F,  B,  and  L 
(mnemonic  for  forward,  backward,  lateral),  whose  connection  is  entirely 
determined  by  the  module  address  (2,p),  that  is: 

F(2,p)  is  connected  to  B(2 ,  (p+l)mod2r) 

B (2 , p)  is  connected  to  F(2, (p-l)mod2r) 

L(2,p)  is  connected  to  L(£  +  e  2P,p) 

where  e  =  l-2bitp(2).  The  interconnection  scheme  is  displayed  in 

lc*  r 

figure  4.  In  words,  the  modules  are  grouped  into  2  cycles,  each 

r 

cycle  consisting  of  2  modules,  cyclically  connected  by  the  F-B  lines. 

The  cycles  are  in  turn  interconnected  as  a  (k-r)-cube;  if 
<Xg,Xp  . .  .  ,Xjc_r._^>  are  the  dimensions  of  the  (k-r)-cube,  all  edges 
along  dimension  x^,  called  collectively  sheaf  i,  link  modules  whose 
addresses  are  (.,i).  The  total  number  of  interconnection  links  is 
3.2k*l-f  mi. 

Each  module  contains  an  operand  register  T,  a  few  memory  locations, 

and  possesses  basic  arithmetic  and  logical  capabilities.  It  is  controlled 

/ 

by  a  stored  program  or  a  circuit  implementation  of  such  a  program. 

For  the  time  being,  we  make  the  hypothesis  of  unlimited  parallelism. 


Figure  4.  The  CCC  interconnection  scheme. 

that  is,  the  number  of  modules  is  tailored  to  the  problem  size;  under 
this  hypothesis,  the  one  or  two  memories  mentioned  earlier  suffice. 
Subsequently  (section  4.3),  under  the  hypothesis  of  limited 
parallelism,  we  shall  endow  each  module  with  a  small  private  random 
access  memory.  In  either  case,  each  module  is  somewhat  simpler  than 
a  current  microprocessor  but  not  basically  different  from  it. 


( 

r 
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4.  EMULATION  OF  THE  k-CUBE  ON  THE  CCC 

In  order  to  implement  DESCEND  on  the  CCC,  we  prune  the  k-cube  so 
as  to  use  only  connections  existing  in  the  CCC.  The  first  stage  consists  in 
removing  the  sheaves  corresponding  to  dimensions  0,1,..., r-1,  and 
using  instead  the  cycle  connections  F  and  B,  as  introduced  in  section  3. 

Our  original  DESCEND  program  is  thus  transformed  to: 
proc  DESCEND 

for  j  *-  k-1  step-1  until  j  =  r 

do  foreach  m:  0  <  m  <  n 

pardo  if  bitj(m)  =  0  then  OPER(m,  j  ;T[m]  .TtnH^"3  ]) 
fi 

odpar 

od; 

foreach  l:  0  <  l  <  2k  r  pardo  LOOPOFER(X)  odpar 
corp  DESCEND . 

Here  procedure  DLOOPOPER(i)  processes  the  data  within  cycle 
l  to  compute  the  desired  result  in  0(2r)  parallel  steps,  as  we 
show  later.  Note  that  the  running  time  is  still  O(k-r)  +  0(2r)  =  O(logn). 

The  second  transformation  consists  in  removing,  for  all 
j  *  0,...,k-r-l,  the  k-cube  links  pertaining  to  sheaf  (r  +  j),  except 
those  existing  between  modules  whose  addresses  are  of  the  form  (.,j): 
the  resulting  interconnection  Is  then  exactly  the  one  of  the  CCC,  as 
introduced  in  Section  3 . 

The  computation  corresponding  to  the  for  loop  of  the  above 
algorithm  can  no  longer  be  performed  in  one  parallel  step.  Using 
repeated  circular  shifts  within  cycles,  however,  each  operand  in  the 
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cycle  can  be  successively  brought  to  reside  for  one  time  unit  in  module 
(•>j)»  where  OPER( . , j ; . , . )  can  then  be  executed.  Although  the  execution 
of  OPER( . , j ; . , . )  for  all  operands  in  a  cycle  now  requires  2r  time  units, 
this  computation  can  be  pipelined  (overlapped)  with  the  analogous 
operations  OPER(.,  i;.,.)  for  r  ^  i  <  k.  To  achieve  pipelining  thus 
requires  a  new  function  BSHIFT(f),  which  performs  a  cyclic  backward 
shift  of  the  operands  in  cycle  l,  that  is: 

foreach  j:  0  ^  j  <  2r  pardo  T[2 *2r+((j-l)mod2r) ]  **  T[X*2r+j] 

odpar . 

The  final  version  of  DESCEND  is  thus: 
proc  DESCEND 

for  i  -  2r-l  step-1  until  i  *  -2r 

do  foreach  l :  0  <  i  <  2k  r 

r  r 

pardo  foreach  p:max(i,0)  :£  p  <  min(2  ,2  +i) 

pardo  if  bitp(A)  *  0  then  OFER(a,b ;U ,V) 
where  a  *  l  *2r-K (p-H-l)mod  2r)  , 
b  =  p+r, 

U  -  T[£*2r+p], 

V  -  T[  (1+2P) *2r+p] . 
fi 

odpar; 

BSHIFT(X)  Comment  backwards  shift  of  cycle  i; 

od; 

Comment  end  of  treatment  on  sheaves  k-l,k-2, . . . ,r ; 

foreach  l :  0  <  l  <  2k-r  pardo  LOOPOPER(l)  odpar 
corp  DESCEND . 
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The  inner  operation  of  the  for  loop  is  executed  in  two  time  units; 
one  for  OPER,  then  one  for  BSHIFT .  The  total  running  time  is  thus  4*2r 
plus  the  time  for  executing  LOOPOPER.  If  we  can  ensure  that  LOOPOPER 
can  be  processed  in  time  linear  in  the  cycle  size,  the  entire  procedure 
will  be  executed  on  the  CCC  in  time  O(logn). 

Figure  5  provides  a  schematic  view  of  DESCEND  on  the  CCC,  and 
conventions  used  are  those  of  figure  1,  which  depicts  DESCEND  on  the 
k-cube.  Here  we  assume  k  =  3,  thus  the  CCC  consists  of  4  cycles  of 
length  2. 
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Figure  5.  DESCEND  on  the  CCC,  k  -  3 


4 . 1  Computation  Within  the  Cycles 

The  next  question  to  be  addressed  is  the  implementation  of 

LOOPOPER(X),  so  that  it  runs  in  time  linear  in  the  cycle  length. 

Obviously,  we  are  constrained  to  using  only  the  F  and  B  cycle  links 

existing  in  the  CCC.  Our  objective  is  to  emulate,  on  the  cycle  of 

length  2C,  the  operation  OPER  as  it  would  be  executed  on  hypothetical 

r-cube  sheaves.  Since  OPER  may  take  place  in  the  cycle  only  between 

adjacent  modules,  particular  care  must  be  exercised  to  ensure  that  the 

desired  adjacencies,  corresponding  to  all  sheaves,  be  globally 

realized  in  time  linear  in  the  cycle  length.  The  key  permutations  for 

this  task  are  based  on  the  perfect  unshuffle  [16,17].  Specifically, 

UNSHUFFLE (X , 1 )  performs  the  perfect-unshuffle  operation  on  each  of 

the  2r  *  1  contiguous  blocks  of  length  2i+1  into  which  T[X *2r; : (2+1) »2 r- 1 ] 

is  subdivided,  and  is  realized  as  follows: 

proc  UNSHUFFLE (X ,i) 

for  b  **  2^  step-1  until  b  =  2 
do  foreach  m:  m  »  X* 2r  +  (2»s+l)»2^  +  p 

where  0  ^  s  <  2r  *  \  -b  <  p  <  b, 

(p  mod  2)  *  (b  mod  2) 
pardo  T[m-1]  **  T[m]  odpar 
od 

corp  UNSHUFFLE. 

Clearly,  UNSHUFFLE (X ,i)  runs  in  (2^-1)  parallel  step.  It  is  also  easy  to 
realize  that  the  program 
proc  BRP(X) 

for  i  -  r-1  step-1  until  i  -  1  do  UNSHUFFLE (X,i)  od 
corp  BRP 

realizes  the  bit-reversal  permutation  of  T[X *2r: : (X+l)2r-l]  with  reference 


( 


to  the  r  least-significant  bits  of  the  addresses. 


! 
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We  can  now  elucidate  the  general  format  of  LOOPOPER,  which  consists 

of  a  sequence  of  unshuffle-operation  pairs,  each  emulating  a  sheaf 

operation.  This  is  preceded  by  BRP,  so  that,  upon  completion,  the  results 

are  in  the  correct  order  (see  figure  6).  In  the  description  below 

the  parameter  a  gives  the  original  address  of  the  operand  which  is 

brought  to  module  (l, p)  by  the  sequence:  BRP;  UNSHUFFLE (l , 0) ; 

UNSHUFFLE (X,l UNSHUFFLE (4, r-l-j).  (Recall  that  q  denotes  the 

integer  whose  binary  representation  is  the  reversal  of  that  of  the 

integer  q.) 

proc  LOOPOPER (4) 

BRP (4 )  ; 

for  j  —  r-1  step-1  until  j  =  0 
do  foreach  q:  0  ^  q  <  2r,  bitQ(q)  *  0 

pardo  OPER(a,j;T[4*2r+q],T[4.2r+q+l]) 

where  a  =  4«2r+(q  mod  2^)  +  (q  mod  2r  '*)»2^. 
odpar : 

UNSHUFFLE  (4  ,  j  ) 
od 

corp  LOOPOPER. 
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Figure  6 .  A  schematic  presentation  of  LOOPOPER  for  r  »  3. 
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With  respect  to  execution  time,  we  noted  that  UNSHUFFLE ( *,i)  runs 

i  2  r-1  r 

in  time  0(2  );  thus  BRP  and  L00P0PER  jointly  run  in  0(l+2+2  +...+2  )  =  0(2  ) 

steps,  linear  in  the  cycle  length. 

4.2  Programs  for  each  Module  of  the  CCC 

From  the  preceding  global  description  of  DESCEND,  it  is  rather 

straightforward  to  produce  the  sequential  program  of  module  (£,p).  The 

program  MODULE (£,p)  for  a  given  DESCEND  algorithm  is  of  the  form: 

HIGHS HEAVES (X ,p) ;L0WS HEAVES (X , p) ,  which  respectively  implement  the 

(k-r)-cube  operation  and  L00P0PER.  The  entire  MODULE (2, p)  is  of  a 

very  simple  nature:  it  basically  counts  up  time  and,  at  each  time  unit 

numbered  t,  it  tests  a  simple  logical  condition  involving  2,p,  and  t; 

depending  on  this  test,  either  it  does  nothing,  or  it  exchanges  operands, 

or  it  exchanges  operands  and  performs  an  operation  on  them.  The  details 

of  these  programs  are  omitted  for  the  sake  of  brevity. 

The  precise  execution  time  of  DESCEND  (or  ASCEND)  on  the  CCC  is 

given  by  the  formula: 


T  =  4.2r  •  T  +  (rt2r)T 

CCC  oper 

where  T  is  the  time  required  for  stepping  up  the  control  variable  t, 
testing  it  and  performing  one  data  exchange  on  some  of  the  links;  Toper 
is  the  time  required  for  computing  0PER(m, j ;U,V)  within  each  module. 

4.3  Limited  Parallelism 

So  far,  we  have  assumed  that  the  size  n  of  the  CCC  was  tailored  to 
the  application.  To  cope  with  the  realistic  situation  where  the  number 
N  of  inputs  is  larger  than  the  size  n  of  the  CCC,  we  suggest  to  let 
each  module  of  the  CCC  be  a  full  fledged  microprocessor  endowed  with  a 


private  RAM  memory. 
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Assuming  for  simplicity  that  N  =  sn,  with  s  =  2q  integer,  we 
require  that  the  RAM  memory  of  each  module  be  of  size  s  and  denote 
by  T[m,0::2q-1]  the  private  memory  locations  of  module  m.  The  input 
ag,...,a^  ^  is  divided  into  consecutive  blocks  of  size  s,  each  block 
being  stored  within  a  module  of  the  CCC,  so  that  T[m,j]  =  a 

2q  ‘nrt-j 

for  0  51  j  <  2q . 

The  only  modification  concerns  the  program  M0DULE(2,p)  (see 

Section  4.2),  which  now  assumes  the  format  HIGHSHEAVES (2 , p) ; LOWSHEAVES (i , p) ; 

L0CAL(2,p).  Programs  for  HIGHSHEAVES  and  LCWSHEAVES  are  the  same  as  before, 

except  that  each  operation  and  data  transmission  is  now  successively 
q 

performed  on  the  2  data  items  of  each  module.  As  for  LOCAL: 

proc  LOCAL  (j£  ,  p ) 
u  *-  m*2q 

for  j  *-  q-1  step-1  until  j  =  0 
do  for  i  -  0  step  1  until  i  =  2q-l 
do  if  bit .  (i)  =  0 

- J  > 

then  OPER(u+i ,  j  ;T[m,i  ]  ,T[m,i+2J  ]  )  fi 

od 

od 

corp  LOCAL. 

It  should  be  clear  by  now  that  all  of  the  algorithms  described  in 

Section  1  can  be  applied  here.  A  direct  analysis  shows  that,  on  a  CCC 

N 

consisting  of  n  processors,  each  processor  having  memory  —  ,  we  can 

N 

process  N  inputs  in  time  0(— *logN)  for  algorithms  in  the  classes  ASCEND  or 


DESCEND,  thus  achieving  the  optimal  speed-up  possible  with  n  processors. 
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5 .  LAYOUT  OF  THE  CCC  FOR  VLSI 

It  is  interesting  to  examine  the  just  described  CCC  within  the 

framework  of  the  "VLSI  model  of  computation"  recently  proposed  [14,18,19]. 

In  this  model,  each  wire  has  unit  width  on  the  silicon  chip  and  transmits 

a  unit  of  information  in  a  unit  of  time;  information  is  taken  from,  or 

delivered  to, special  areas  on  the  chip,  called  nexuses,  each  associated 

with  a  module.  Within  this  model,  which  takes  realistic  account  of  the 

placement  of  modules  and  interconnection,  C.  D.  Thompson  has  studied  the 

implementation  of  the  Fast-Fourier-Transform  [18]  and  has  elucidated 

significant  relationships  between  input  size  n,  chip  area  A,  processing 

time  T,  and  the  so-called  minimal  bisection  width  x .  ^  Thompson  has  shown 
2 

that  A  s  'JO  /4  in  general,  and  that,  for  the  n-point  FFT,  T  ^  n/2x,  thus 

2  2 

establishing  the  bound  AT  2  n  / 16 .  The  lower  bound  for  time  applies  to  a 
wider  class  of  problems,  as  shown  by  the  following  proposition  which  we 
state  without  proof: 

Proposition:  In  the  VLSI  model  (Thompson  [18]),  time  T  s  ^  is  required  to 
merge  two  sorted  sequences  of  length  n/2,  or  to  realize  the  data  rearrange¬ 
ment  specified  by  some  permutation  drawn  from  a  transitive  group  of 
(2) 

permutations. 

2  n2 

As  a  consequence,  we  have  AT  2  -rjr  for  all  such  problems. 


^For  a  graph  G  *  (V,E)  the  minimal  bisection  width  x  is  defined  as  the 
smallest  integer  such  that  x  ■  |[(u,v)  €  E:u  €  V^,v  ^  ’  w^ere 

^■V1,V2^  3  ?art^t^on  v  with  |V  |  <  | V^ |  —  |V  |  +  1. 

(2) 

A  subgroup  G  of  the  symmetric  group  Sn  is  said  to  be  transitive  if 
Yi,j  1  <  i,j  <  n,  ac  €  G:a(i)  *  j,  meaning  that  data  located  in  any 
position  of  the  machine  may  be  moved  into  any  other  position  of  the  machine. 


/ 


-  r 
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With  the  CCC,  we  have  shown  that  operations  such  as  FFT,  merging, 

cyclic  shifts,  shuffles,  etc.,  are  all  realizable  in  the  minimal  achievable 

2  ? 

time  T  =  O(logn).  We  now  demonstrate  that  A  =  0(n  /logn  )  thus  achieving 
the  lower  bound  exactly;  this  means  that  the  CCC  is  optimal  in  the  VLSI 
model  for  FFT,  merging  of  sorted  sequences,  and  realization  of  permutations 
drawn  from  a  transitive  group.  In  contrast,  known  layouts  for  the  k-cube 
or  the  perfect  shuffle  have  area  of  a  larger  order. 

2 

To  achieve  A  =  O((n/logn)  )  for  the  CCC,  consider  a  layout  which 

uses  two  sheaves  of  evenly  spaced  wires,  horizontal  and  vertical,  used 

respectively  for  cube  and  cycle  connections.  Figure  7  pictorially 

provides  base,  inductive  hypothesis,  and  extension,  to  prove  that  an  n  =  s*2S 

s  s 

module  CCC  can  be  placed  on  a  2  x  (2.2  -1)  chip;  since  s  ~  log0 (n/log9n) , 
the  chip  size  is  about  (n/log2n)  x  (2n/log2n-l)  =  0 ( (n/logn)2) .  Slightly 
more  complicated  constructions  yield  somewhat  more  efficient  module 
placements  as  suggested  by  figure  8. 
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Figure  8.  A  more  economical  layout  for  the  interconnection  of  4.2  modules. 


For  pedagogical  reasons,  the  CCC  introduced  so  far  has  a  number 

S  IT 

n  =  s*2  of  processing  modules  with  s  =  2  a  power  of  2.  A  more  general 

g 

version  of  the  CCC  can  be  designed,  comprising  n  =  h‘2  modules.  Each 

s  s 

of  the  2  cycles  of  the  machine  has  h  ^  s  modules.  The  lower  s  x  2 

modules  of  the  cycles  exhibit  the  horizontal  interconnection  of 

g 

standard  CCC,  while  the  (h-s)  x  2  higher  modules  only  have  vertical 
(cycle)  connections,  as  indicated  in  figure  9  .  Such  a  layout  has  height 

S  S"f“l 

2  +  h-s  and  width  2  (in  unit  wire  width) .  The  programs  presented  in 

section  4  can  be  adapted  to  run  on  such  a  machine  by  simply  ignoring 
operations  pertaining  to  non-existing  horizontal  (external)  links,  and 
their  running  time  is  proportional  to  the  cycle  length  h.  We  see  that,  for 
any  value  of  h  satisfying  log2n  <  h  <  *Jn,  the  area  x  (time)2  product 

AT2  =  (JJ  +  h  -  log(g))  x  g  X  h2  =  n2  +  nh2  -  nh  log(^)  =  0(n2) 
meets  the  optimal  theoretical  bound,  to  within  a  constant  factor.  Of 
particular  interest  is  the  choice  h  =  0(</n) ,  which  leads  to  a  running 
time  T  -  0(.Jn)  and  uses  the  minimal  achievable  area  A  =  0(n) . 


Figure  9.  A  standard  layout  for  an  h  x  2s  CCC  (h  *  6,  s  «  4) . 
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6 .  CONCLUSION 

In  this  paper,  we  have  proposed  a  structure  which  can  be  used  for 
direct  hardware  implementation  of  specific  useful  algorithms,  or,  as 
suggested  in  section  4.3,  as  a  general  purpose  parallel  processing 
system. 

We  expect  the  CCC  to  be  practically  feasible  in  the  present  state  of 
the  technology,  and  to  be  capable  of  executing  efficiently  a  wide  variety 
of  algorithms.  The  extent  of  the  class  of  algorithms  amenable  to  efficient 
CCC  processing  is  not  yet  well  understood,  but  it  goes  beyond  the 
applications  described  in  Section  1;  in  particular,  it  includes  a  variety 
of  matrix  and  graph  algorithms,  as  well  as  arithmetic  and  algebraic 
problems . 

Another  salient  feature  of  this  work  is  the  possibility  which  appears 
to  exist  of  developing  a  high  level,  general  purpose  language  for  parallel 
programming,  which  would  nevertheless  be  automatically  compilable  on  systems 
such  as  the  CCC . 
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